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Abstract 

This paper describes algorithms for nonnegative matrix factorization (NMF) with the /3-divergence 
(/3-NMF). The /3-divergence is a family of cost functions parametrized by a single shape parameter 
j3 that takes the Euclidean distance, the KuUback-Leibler divergence and the Itakura-Saito diver- 
gence as special cases (/? = 2, 1,0 respectively). The proposed algorithms are based on a surrogate 
auxiliary function (a local majorization of the criterion function). We first describe a majorization- 
minimization (MM) algorithm that leads to multiplicative updates, which differ from standard heuris- 
tic multiplicative updates by a /3-dependent power exponent. The monotonicity of the heuristic algo- 
rithm can however be proven for /3 G (0, 1) using the proposed auxiliary function. Then we introduce 
the concept of majorization-equalization (ME) algorithm which produces updates that move along 
constant level sets of the auxiliary function and lead to larger steps than MM. Simulations on syn- 
thetic and real data illustrate the faster convergence of the ME approach. The paper also describes 
how the proposed algorithms can be adapted to two common variants of NMF: penalized NMF (i.e., 
when a penalty function of the factors is added to the criterion function) and convex-NMF (when 
the dictionary is assumed to belong to a known subspace). 

Keywords: Nonnegative matrix factorization (NMF), /3-divergence, multiplicative algorithms, 
majorization-minimization (MM), majorization-equalization (ME). 



1 Introduction 

Given a data matrix V of dimensions F x N with nonnegative entries, NMF is the problem of finding a 
factorization 

V w WH (1) 

where W and H are nonnegative matrices of dimensions F x K and K x N, respectively. K is usually 
chosen such that FK + KN^FN, hence reducing the data dimension. The factorization is in general 
only approximate, so that the terms "approximate nonnegative matrix factorization" or "nonnegative 
matrix approximation" also appear in the literature. NMF has been used for various problems in diverse 
fields . To cite a few , let u s mention the problems of learni ng parts of faces and s eman tic features of 
text (|Lee and Seung . polyphonic mu sic transcrip t ion (jSmaragdis and Brownl . |2003|). object char 



acterization by reflectance spectra anal ysis (iBerrv et al. . 2007), portfolio div ersification ( Drakakis et al 
[20071) . DNA gene expression analysis (jBrunet et al.l . l2004t iGao and Church. 2005.1. c luster ing of pro 



tein interactions (jCreene et al. . 2008 ). image denoising and inoainting (jMairal et al. . 2010l ). etc. The 



factorization ([T]) is usually sought after through the minimization problem 

min i:>(V|WH) subject to W > 0,H > 



(2) 



where the notation A > expresses nonnegativity of the entries of matrix A (and not semidefinite 
positiveness), and where D(V|WH) is a separable measure of fit such that 



N 



D{V\Wn) = J2Y. rf([V]/„|[WH]/„) 



(3) 



/=ln=l 



1 



where d{x\y) is a scalar cost function. What we intend by "cost function" is a positive function of y G 
given X G M+j with a single minimum for x = y. 



A p opular cost function i n NM F is the /3-divergence dj3{x\y) of Basu et al. (jl998[) : Eguchi and Kanol 



( 2001 ): Cichocki and Amari ( 2010l ). defined rigorously in Section [2TT1 In essence, it is a parameterized 
cost function with a single parameter /3, which takes the Euclidean distance, the generalized KuUback- 
Leibler (KL) divergence and the Itakura-Saito (IS) divergence as special cases {(5 — 2,1 and 0, respec- 
tively). NMF with the /3-divergence has been widely used in music signal processing in particular, for 



transcription and s ource separation ( O'Gradyl. 120071: lO'Gradv and Pearlmutterl 12008 



Vincent et al.l 



Dessein et al.. 201 



2009[lBertin et al.l . l2009HFevotte et al.l . l2009t 
2010r i. In these work the nonnegative data matrix V is a spectrogram which is decomposed into ele 



20inl: 



a 



FitzGerald et al 



Henneauin et al 



mentary spectra with NMF. The parameter /3 can be tuned so as to optimize transcription or separation 
accuracy on training data. While popular in music signal processing, NMF with the /3-divergence (short- 
ened as "/3-NMF" in the rest of the paper) can be of interest to any field: the parameter (3 essentially 
controls the assumed statistics of the ob servation noise and can eith er be fixed or learnt from training 
data or by cross-validation. As noted bv iFevotte and Cemgill (j2009l) . the values /3 = 2, 1,0 respectively 
underly Gaussian additive, Poisson and multiplicative Gamma observation noise. The / 3-divergence offers 
a continuum of noise statistics that interpolate between these three spec i fic ca ses, see ( Basu et al.l . Il998t 
Eguchi and Kanol l200ll : iMinami and Eguchi l2002t ICichocki and Amaril . |2010| ). 



The standard /3-NMF algorithm used in the above-mentioned papers is presented as a gradient-descent 
algorithm where the st ep size is set adaptive ly and chosen such that the updates are multiplicative, as 
originally described by Cichocki et al .* ('2006'). The same algorithm can be derived from the following 



heuristic, proposed bv iFevotte et al. ()2009il . Let 6* be a coefficient of W or H. As will be seen later, 
when using the /3-divergence the derivative VoD{9) of the criterion Z)(V|WH) with respect to (w.r.t) 6 
can be expressed as the difference of two nonnegative functions such that \/gD{9) = D{9) — Vg D{9). 
Then, a heuristic multiplicative algorithm simply writes 



■V+D{9) 



(4) 



which ensures nonnegativity of the parameter updates, provided initialization with a nonnegative value. 
It produces a descent algorithm in the sense that 9 is updated towards left (resp. right) when the gra- 
dient is positive (resp. negative). A fixed point 9* of the algorithm i mplie s either VgD{9*) = or 
9* = Q. Monotonicity of this algorithm has been proven by iKompassI (|2007l ) for the specific range of 
values of /3 for which the divergence dp{x\y) is convex w.r.t y (i.e., /3 G [1,2], see Section [2.1^ . The 
proof is based on a majorization-minimization (MM) procedure: an auxiliary function is built and it- 
eratively minimized for each column of H (given W) and each row of W (given H). The auxiliary 
function is built using Jensen's inequality, thanks to convexity of the cost for /3 G [1,2]. However, it was 
observed in practice that the multiplicative algorithm (|4]) is still monotone (i.e., decreases the criterion 
function at each iteration) for values of /3 out of the "convexity" interval [1, 2], though no proof is to avail. 

This paper studies three descent algorithms for /3-NMF, based on an auxiliary function that unifies ex- 
isting auxiliary functions for the Euclid ean distance and KL divergence ( De Pierro . 1993 : Lee and Seund . 
200 1[ ). the "generalized divergence" of iKompassI (j2007|) and th e IS divergenc e (ICao et al lll999|). This 



auxiliary function was also recently proposed independently by Nakano et al. ( 2010l ). The construction 



of the auxiliary function relies on t he decomp o sition of the criterion function into its convex and con- 
cave parts, following the approach o f[c ao et al. for the IS divergence. An auxiliary function to the 
convex part is constructed using Jensen's inequality while the concave part is locally majorized by its tan- 
gent. It is shown that MM algorithms based on the latter auxiliary function yield multiplicative updates 
that coincide with the heuristic described by Eq. dp for /3 G [1, 2 ] but differ from a /3-dependent power 
exponent when /3 ^ [1,2], a result also obtained by Nakano et al.l ( 2010t ). Additionally, we show that the 
monotonicity of the heuristic algorithm can however be proven for /3 G (0, 1), using the proposed auxil- 
iary function (it is shown to produce a descent algorithm though it does not fully minimize the auxiliary 
function). Then we introduce the concept of maximization- equalization (ME) algorithm which produces 
updates that move along constant level sets of the auxiliary function and leads to larger steps than MM. 
This is akin to overrelaxation and is shown experimentally to produce faster convergence. Finally we show 
how the described MM, ME and heuristic algorithms can be adapted to two common variants of NMF: 
penalized NMF (i.e., when a penalty function of W or H is added to the criterion func tion) and "convex " - 
NMF (when the dictionary is assumed to belong to a known subspace, as proposed bv lPing et al. I (|201(l ^. 
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Tabic 1: Example of differentiable convex-concave-constant decomposition of the /3-divergence under the 
form ©. 



The paper is organized as follows. Section [5] defines and discusses the /3-divergence, and then exposes 
in details the optimization task addressed in this paper. Section[3]recalls the concept of auxiliary function 
and then introduces a general auxiliary function for the /3-NMF problem. Section H] describes algorithms 
based on the proposed auxiliary function, namely MM and ME algorithms, and describe how they relate 
to the heuristic update (|4|). Section[5]reports simulations and convergence behaviors on synthetic and real 
data (with audio transcription and face interpolation examples). Section [5] describes extensions of the 
proposed algorithms to penalized and convex- NMF. Section [7] concludes and discusses open questions. 



2 Preliminaries 



In this section wc present the /3-divergence and more precisely sp ecify the task that is addressed in this 
paper. A detailed exposition of the /3-divergence can be found in ( Cichocki and Amari . 2010f ). 



2.1 Definition of the /^-divergence 

The /3-divergence was introduced by Basu et al. (1998) and Eguchi and Kanol ( 2001 ) and can be defined 



dii(x\y) 



def 



/3(/3-l) 



{xf^ ^{fi - \)y^ - ^xy 

+ y 

log ? - 1 



X log - 

t> y 



(5) 



13 e M\{0,1} 
/3 = 1 

y ^ y ^ 

Basu et al. (1998) and Eguchi and Kano (12001) a ssume /3 > 1, but the definition domain can be extended 
to /3 e K, as suggested bv ICichocki et al.l (|2006l ). which is the definition domain that is considered in 
this paper. The /3-divergence can be shown continuous in /3 by using the identity lim^^o (2:^ — y^)l P = 
\og{x/y). The limit cases /3 = and /3 = 1 correspond to the IS and KL div ergences, respect ively. The 
/3-divergence coincides up to a factor with the "generalized divergence" of lKompas^ (|2007[ ) which, in 
the context of NMF as well, was separately constructed so as to interpolate between the KL divergence 
(/3 = 1) and the Euclidean distance (/3 = 2). The /3-divergence is plotted for various values of /3 on 
Figured] Note that in this paper we will abusively refer to (i^=2 — {x ~ 2/)^/2 as the Euclidean distance, 
though the latter is formally defined with a square root, and for vectors. 

The first and second derivative of d/3{x\y) w.r.t y arc continuous in /3, and write 



4(x|y)=/-2(y-.T), 

p _ i)y _ (/3 _ 2)x] 



(6) 
(7) 



The derivative shows that (i^(a;|y), as a function of y, has a single minimum in y = x and that it increases 
with jj/ — ccj, justifying its relevance as a measure of fit. The second derivative shows that the /3-divcrgence 
is convex w.r.t y for /3 G [1,2]. Outside this interval the divergence can always be expressed as the sum 
of a convex, concave and constant part, such that 



dp{x\y) = d{x\y) + d{x\y) + d{x) 



(8) 



where d{x\y) is a convex function of y, d{x\y) is a concave function of y and d{x) is a constant of y. The 
decomposition is not unique, since constant or linear terms (w.r.t y) are both convex and concave, or, less 
trivially, since any convex term can be added to d[x\y) while subtracted from d{x\y). In the following 
we will use the "natural conventions " given in Table [T] 

As noted by Fevotte et al. ( 20091 ). a noteworthy property of the /3-divergence is its behavior w.r.t to 
scale, as the following equation holds for any value of /3: 



dfi{\x\\y) = A'^ dp{x\y). 



(9) 
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(a) /3 < (b) < /? < 1 




Figure 1: /3-divergence as a function of y (with x = 1). The subfigures illustrate the regimes of 

the /3-divergence for its five characteristic ranges of values of /?. The divergence is convex for 1 < /3 < 2, 
as seen on subfigures (c) and (d). On the other subfigures, the inflection points are indicated with vertical 
dotted lines. For /3 < 0, the divergence possesses horizontal asymptotes of coordinate x^/{f3{l3 — 1)) as 
y — !• cx). For /3 > 1, the divergence takes finite value / {I3{f3 — 1)) at y = 0, where the derivative is zero 
for /? > 2. 



It implies that factorizations obtained with /3 > (such as with the Euclidean distance or the KL 
divergence) will rely more heavily on the largest data values and less precision is to be expected in the 
estimation of the low-power components, and conversely factorizations obtained with (3 < will rely more 
heavily on smallest data values. The IS divergence (/3 = 0) is scale- invariant, i.e., dis{Xx\Xy) = dis{x\y), 
and is the only one in the family of /3-divergences to possess this property. Factorizations with small 
positive values of /S are relevant to decomposition of audio spectra, which typically exhibit exponential 
power decrease along frequency / and also usually comprise low-power transient components such as 
note attacks togethe r with higher power components such as tonal parts of sustained notes. For example, 
Fevotte et al. ( 20091 ) present the results of the decomposition of a piano power spectrogram with IS-NMF 



and show that components corresponding to very low residual noise and hammer hits on the strings are 
extracted with great accuracy, while these components are either ignored or severely degraded whe n 
using Euclidean or KL divergences. Similarly, the value /? = 0.5 is advocated bv lFitzGerald et"al ( 2009f ): 
Henneauin et al. ( 2010h and has b een shown to g i ve op timal results in music transcription based on NMF 



Menneaum et al.l (|^U1UI I and has b een shown to g i ve op ti 
of the magnitude spectrogram bv lVincent et al.l ( 2010[) . 



The /3-divergence belongs to the family of Bregman divergence s. For /3 ^ |0. 1), a s uitable Bregman 
generating function is (f>{y) ~ y^ /{j3{l3 — 1)), as noted bv , Fevotte and Cemgfl ( 2009f ). This function, 
however, cannot generate the IS and KL divergences by continuity when j3 tends to or 1. The latter 



4 



divergences may non etheless be generated "sepa rately" , using the functions = — log y and 4>{y) = 
ylogy, respectively. ICichocki and Amari ( 2010f ) give a general Bregman generating function of the /3- 
divergence, defined for all /3 e R, in the form of 0/3/o,i(2/) = {u^ — Pu + P — — 1)), (jip=o{y) = 

y — log?/ — 1 and (t>g= i(y) ~ y^ogy — y + 1. NMF with Bregman divergences has been considered by 
Dhillon and where the lack of results about the monotonicity of multiplicative algorithms in 



general has been notedly This paper fills this gap for the specific case of /^-divergence. 



2.2 Task 

Core optimization problem As to our best knowledge all algorithms in the literature to date, the 
NMF algorithms we describe in this paper sequentially update H given W and then W given H. These 
two steps are essentially the same, by symmetry of the factorization (V « WH is equivalent to V-'" « 
H^W"^ and the roles of W and H are simply exchanged), and because we are not making any assumption 
on the relative values of F and N . Hence, we may concentrate on solving the following subproblem 

min C(H) =^ i:>(V|WH) subject to H > (10) 

H 

with fixed W and where in the rest of the paper D(V|WH) is as of Eq. ^ with d{x\y) = d^(a;|?/). The 
criterion function C(H) separates into J^n -C'(v„|Wh,i), where v„ and h„ are the n*'' row of V and H, 
respectively, so that we are essentially left with solving the problem 

min C(h) ^ D(v|Wh) subject to h > (11) 

h 

where V e M^, W e RI""^ and h e R^. 



KKT necessary conditions An admissible solution h* to problem pT|) must satisfy the Karush- 
Kuhn- Tucker (KKT) first order optimality conditions, which write 

VhC(h^).h* = (12) 

VhC(h^) > (13) 

h* > (14) 

where the dot notation '.' denotes entrywise operations (here term-to-term multiplication) and VhC(h) 
denotes the gradient of C(h), given by 

y^cih)^w^[d'ivf\[Wh]f)]f (15) 

= W^[(Wh)-(^-2)(Wh- v)] (16) 

where the notation [xf]f refers to the column vector [xi, . . . ,xf]'^- The KKT conditions (|12p - P^ can 
be summarized as 

min{h^VhC(h*)}-0^f (17) 
where the "min" operator is entrywise and Ok is a null vector of dimension K. 



Algorithms In the following, we will say that an algorithm is monotone if it produces a sequence 
of iterates {h(*)}i>o, such that C(h*^*+^') < C(h('^) for all i > 0. An algorithm is said convergent if 
it produces a sequence of iterates {h*^*)}i>o which converges to a limit point h"^ satisfying the KKT 
conditions p2p - p4[) . Monotonicity docs not imply convergence in general, nor is monotonicity necessary 
to convergence. 



3 An auxiliary function for /3-NMF 

In this section we properly define the concept of auxiliary function and then exhibit a separable auxiliary 
function for the /3-NMF problem. 

^Morc precisely. iDhillon and Sra (2005) give proofs of monotonicity for the "reverse" problem of minimizing D(WH|V) 
instead of D(V| WH), while pointing that monotonicity of multiplicative algorithms based on the heuristic Q for the latter 
problem is however observed in practice. 
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Figure 2: The /3-divergeiice d^(a;|y) for /3 = 0.5 (with x = 1) and its auxihary function in dimension 
one (with h = 2.2). The MM update corresponds to the minimum of the auxiliary function, 

see Section 14.11 The heuristic update given by Eq. is discussed in Section 14.21 (the heuristic 
update minimizes the criterion function in the simple one-dimensional case but this is not true in larger 
dimensions). The ME update consists in selecting the next update "beyond the valley" defined by 
the auxiliary function, from the current solution h, sec Section [4731 



3.1 Definition of auxiliary function 



Definition 1 (Auxiliary function) . The . 
to C(h) if and only if 

. Vh e , C(h) = G(h|h) 



V(h,h) e 



, C(h) < G(h|h) 



mapping G(h|h) is said to be an auxiliary function 



In other words an auxiliary function G'(h|h) is a majorizing function or upper bound of C(h) which is 
tight for h = h. The optimization of C{h) can be replaced by iterative optimization of G(h|h). Indeed, 
any iterate h(*+^) satisfying 



G'(h(''+i)|h(')) < G(h«|h«) 
satisfies G(h(*+i)) < G(h(*)), because we have 

G(h(*+i)) < G(h(*+i)|hW) < G(h«|h«) = G(h(' 
The iterate h^'+-'^' is typically chosen as 

h('+^) = argmin G(h|h(*)) 

h>0 



(18) 



(19) 



(20) 



which forms the basis of maximization-minimization (MM) algorithms, see, e.g., ([Hunter and Lange 
20041 ) for a tutorial. However, any other iterate h(*+^^ satisfying ([T5| produces a monotone algorithm. 
As such. Figure [2] illustrates the three updates strategies that will be developed in this paper. 
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3.2 Separable auxiliary function for /3-NMF 

In this section wc construc t an auxiliary fun ction to C(h) for the specific case of the /3-divergence. Our 
approach follows the one of Cao et al.l ( 19991 ) for IS divergence, and consists of majorizing the convex part 
of the criterion using Jensen's inequality and majorizing the concave part by its tangent, as detailed in 
the proof of the following theorem. Here and henceforth, we denote Wh by v, with entries [Wh]/ = w/. 

Theorem 1 (Auxiliary function for /3-NMF). Let h be such that 

(i) V/,S/ >0 

(ii) Vfc, hk > 

Then, the function G'(h|h) defined by 



f 



E 



d Vf\Vf^ 



d'ivf\vf)J2^fk{hk 
k 



hk) + d{vf\if) 



divf) (21) 



is an auxiliary function to C(h) = J2f d'{vf\[^i>]f), where d{x\y) + d{x\y) + d{x) is any differentiable 
convex-concave-constant decomposition of the /3-divergence, such as the one defined in Table [TJ 



Proof. The condition G'(h|h) = C(h) is trivially met. The criterion C(h) may be written as 

C7(h)=^Q(h) 
/ 



(22) 



def 



where C/(h) — (i(w/|[Wh]/). We prove C(h) < G(h|h) by constructing an auxiliary function to each 
part C/(h) of the criterion, and more precisely by treating the convex and concave part separately. Let 

us define C/(h) d{vf\[Wh]f) and C/(h) d{vf\['Wh]f), so that we can write 



Cf{h)=Cf{h) + Cf{h)+d{vf). 



(23) 



Convex part: We first prove that 



Gf{h\h) = J2 



■Wfkhk ■-, f ... hk 
— d Vf\vf — 

Vf \ hk 



(24) 



is an auxiliary function to C/(h). The condition G/(h|h) = C'/(h) is trivially met. The condition 
G'/(h|h) > C/(h) is proven as follows. Let K. be the set of indices k such that Wfk ^ 0. Define Vfc S K,, 



Xfk 



Wfkhk _ Wfkhk 



(25) 



We have Yl,keK ^fk = ^ and 



Gf{h\h) = ^~Xfk d 



Vf 



Wfkhk 



keK 



A 



/fc 



Wfkhk 



>d[vf\Y,h- , 
\ keK ^fk 

/ K \ 



d Vf\^Wfkhk 



k=l 



= Cfih) 

where we used Jensen's inequality, by convexity of d{x\y). 



(26) 

(27) 

(28) 
(29) 



Concave part: An auxiliary function G'/(h|h) to the concave part C /(h) can be taken as the first order 
Taylor approximation to C/(h) in the vicinity of h, i.e.. 



G/(h|h) = C/(h)-t-V^C/(h) (h-h). 



(30) 
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The function satisfies G/(h|h) = Cf{h) by construction and G/(h|h) > C/(h) by concavity of C/(h), 
using the property that the tangent to any point is an upper bound of a concave functionl^ Using 



VH,Cf{h)=Wfkd'{vf\[Wh]f) 
the exphcit form for G'/(h|h) is given by 

G/(h|h) = d{vf\if) + d'{vf\if)Y,Wfk{hk~hk). 



(31) 



(32) 



In the end a suitable auxiliary function G'(h|h) to C(h) is obtained by summing up the auxiliary 
functions constructed for each individual part of the criterion, i.e., 



G(h|h) = J2 (G/(h|h) + G/(h|h) + divf)) 



(33) 



which leads to Eq. (|2T 



□ 



Properties of the auxiliary function G(h|h) is by construction separable in functions of the indi- 
vidual coefficients hk of h, which allows to decouple the optimization. It is convenient to rewrite the 
auxiliary function as such in order to derive some of the algorithms of Section ([4]). We may write 



G(h|h)==^Gfe(/i,|h) + cst 



(34) 



where est is a constant w.r.t h and 



Gk{hk\h)'^^^hk 



EWfk J / I - "-fc 



The gradient of the auxiliary function is given by 

V,,G(h|h) = ^«;;, 
/ 



+ hk 



hk 



J2wfkd'ivf\vf) 
/ 



d' ( ^f\^fj^ ) + d'{vf\vf) 



(35) 



(36) 



Thanks to the separability of the auxiliary function into its variables the Hessian matrix is diagonal with 



VlG{h\h)=J2if^d''(vf\i/{ 



i~ hk 



(37) 



By convexity of d(x\y) we have d"{x\y) > which implies positive definiteness of the Hessian matrix 
and hence convexity of the auxiliary function G(h|h) (convexity more simply derives from the fact that 
the auxiliary function is built as a sum of convex functions). 



Connections with other works The construction of G(h|h) employs standard mathematical tools 
(Jensen's inequality, Tay lor approximation) that are well known from the MM literature, see, e.g.. 



itv 

Hunter and Langell2004[ ). For (3 G [1, 2], G(h|h) coincides with the auxiliary function built bv iKompass 



2007p . This latter paper proposed itself a generalization of the auxiliary functions proposed bv lLee and Seung 
200l|) for the Euclidean distance (/3 = 2) and the generalized KL di vergence (0 = 1) . For /? = (IS 

T). 



divergence), G(h|h) coincides with the a uxiliary function propo sed bv lCao et al. (19991). It is worth re 
calling that in the algorithms proposed bv lLee and Seung (I2OOII) the update of W given H or H given W 
are instances of well known algorithms for image restoration (for which W acts as a fixed, known blurring 
matrix and H is a vectorized im age to be reconstructed). These algorithms a re the Iterative Space Re- 
constructing Algo rithm (ISRA ) of Daube-Witherspoon and Muehllehnei ( 1986f ) and the Richardson-Lucy 
(RL) algorithm of RichardsonI ( 1972t ): ILucvI ( 1974 ). which perform nonnegative linear regression with the 
Euclidean dis tance and KL div ergence, respectively. The ISRA and R L algorithms are s hown to be MM 
algorithms bv lDe Pierro ( 19931 ). Similarly, the algorithms proposed bv lCao et al. ( 19991 ) for nonnegative 
linear regression with the IS divergence were designed in the image restoration setting. Finally, let us 
mention that an auxiliary function based on Jensen's ine quality for NMF with the a-divergence (which 



is always convex w.r.t to its second argument) is given bv lCichocki et al.l (|2008l ). 



^Cj-(h) = d{vf\['Wh]f) is concave as the composition of a concave function and a linear function. 
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/3< 1 


1 < ^ < 2 


P>2 


7(/3) 


1 

2-0 


1 


1 

/3-1 



Tabic 2: Exponent in the multiplicative updates given by the MM algorithm. 



4 Algorithms for /3-NMF 

In this section we describe algorithms for /3-NMF based on the auxiliary function constructed in the latter 
section. In the following h should be understood as the current iterate h*^*^ and we are seeking to obtain 
h(*+i) such that Eq. ^ is satisfied. 



4.1 Maximization-Minimization (MM) algorithm 

An MM algorithm can be derived by minimizing the auxiliary function G(h|h) w.r.t to h. Given the 
convexity and the separability of the auxiliary function the optimum is obtained by canceling the gradient 
given by Eq. ([M]). This is trivially done and leads to the following update: 

~l3-2 \ 7(/3) 



hf^ = h, ( V, (38) 



where 7(/3) is given in Tabled Note that 7(/3) < l,V/3. As suggested in Section (TJ the gradient of the 
criterion may be written as the difference of two nonnegative functions such that 



V,.,C(h)=V+C(h)-V^^C(h) (39) 

V+C(h)=^u;/..t)^-i (40) 
/ 

VlC{\,)^Y.'"fkVfV^f^ (41) 
/ 

so that the update (|38p can be rewritten in the more interpretable form 

The conclusion is thus that the MM algorithm leads to multiplicative updates, bu t the latter differ from 
the "usual ones" , obtained by setting ^{P) = 1 for all /? and derived h eurist ically by Cichocki et al. ( 2006[ ) 



through gradient descent with adaptative step or by Fevotte et al.l (200^ by splitting the gradient into 



two nonnegative functions as discussed above and in Section [T] The MM update differs from the heuristic 
update by the exponent 7(/3) which is not equal to one for /3 ^ [1, 2]. 



4.2 Heuristic algorithm 

This s ection discusses the properties of the heuristic update proposed bv lCichocki et al.l (|2006l) : lFevotte et al 



20091) and defined for all /3 by 



(43) 



Very few mathematical results exist for the heuristic update when /3 falls outside [1,2], i.e., when the 
/3-divergence dp{x\y) is not convex. In such a case, the heuristic update can be erroneously interpreted 
as an MM algorithm by wrongly applying Jensen's inequality to C{h). Yet, in the particular case /3 = 0, 



it holds that each heuristic update produces a decrease of C(h) ( Cao et al. . 19991 ). One objective of this 



section is to extend this result to all values of /3 between and 1. 



Let us first introduce a scalar auxiliary function g{y\y]x) as follows: 

\ly,y,x>Q, g{y\y;x) = d{x\y) + d{x\y) + [y - y)d'{x\y) + d{x) (44) 

where d{x\y), d{x\y) and d{x\y) are defined in Table [T] By immediate application of Theorem[T]to the 
scalar case, g{y\y;x) is an auxiliary function to d{x\y). In particular, g{y\y;x) — d(x\y). Then, we have 
the following preliminary result. 
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Lemma 1. For all /3 e R, 




Gfe(/ifc|h) \ ywfki''f-' \g{hk\hk;hf) + cst. (45) 



Proof. For each of the four possible expressions of {d,d) given in Tabled the validity of (P5|) can be 
checked straightforwardly by direct verification. □ 

As already mentioned in Section 13.11 the MM update ([20]) is not the only way of taking advantage of 
the auxiliary function G'(h|h) to obtain a decrease of C(h): any update satisfying ([T5|) also ensures that 
C(h) does not increase. This is a key remark to understand the behavior of the heuristic algorithm for 
/3 £ (Oj l)i given the following property. 

Theorem 2. For all (3 6 (0,1), and all h such that conditions (i)-(ii) of Theorem [T] hold, the heuristic 
algorithm produces nonincrcasing values of C(h), according to the following inequality: 

G(h"|h) < G(h|h). (46) 

Proof. For all [3 £ (0, 1), straightforward calculations yield 

9{y\y;x) -g{x\y;x) = d{x\y) - d{x\x) - {x~y)d'{x\y) (47) 

^ y'^(l -/3 + /36i-6'^) (48) 



1-13 



where 6 = x/y. Since f{6) — B'^ is a concave function of 9, wc have f{d) < /(I) + {6 — 1)/'(1), which 
also reads O'^ < 1 + {9 — 1)13. Hence, g{y\y;x) — g{x\y;x) > for all x, y. The latter inequality implies 
Vfc, g{h^\hk,h^) < g{hk\hk,h^), so that we have Gk{hf\h) < Gk{hk\h) according to (^5]). which leads 
to the result by summation over k. □ 



Cao et al. show that inequality becomes an equality in the case /3 = 0, so that each 



heuristic update yields G(h^|h) — G(h|h). In this particular case, the heuristic algorithm can be called 
a "majorization-cqualization" algorithm, a class of algorithms described in next section. For values of 
f3 outside the range [0,2], inequality does not hold anymoreH Of course, this does not mean that 
the heuristic updates produce increasing values of G(h). On the contrary, numerical simulations tend to 
indicate that they always produce nonincreasing values of G(h), but proving this is still an open issue 
for /? [0, 2]. Compared to MM updates, heuristic updates produce larger or equal steps for all /3, since 
it can trivially be shown that 

Vfc, K-hk\>K'^-hk\. (49) 

For (3 ^ [1,2], numerical simulations indicate that the heuristic algorithm is faster than the MM algo- 
rithm (and we recall that the two algorithms coincide for /3 G [1, 2]). Given (P^ . skipping from the latter 
to the former has an effect comparable to that of overrelaxation: on the average, stretching the steps 
allows to reduce their number to reach convergence. This will be discussed in more details in Section [4.41 



In order to produce even larger steps for /3 S [0, 2], and yet nonincrcasing values of G(h), the following 
section explores the concept of majorization-cqualization. 

4.3 Majorization-Equalization (ME) algorithm 

Let us introduce the general notion of ME update by the fact that the new iterate h'^^ fulfills 

G(h^Elh) = G(hlh). (50) 

Eq. (j50p actually defines a level set rather than a single point. Let us concentrate on the following more 
constrained and manageable condition, given the separability of G(h|h): 

Vfc, Gk{hf^\h) = Gk{hk\h). 

Given (j45|) . this amounts to solve the following equation for y, for any y,x > 0: 

9{y\y;x) = g{y\y;x). (51) 



^Indeed, we can prove that the reversed inequahty holds for aU /3 < 0, while no systematic result is known for /3 > 2. 
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p \ u 


^ —1 P — ^ 


1 ^ p ^ z 


p z 
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2 


2 


1 


-1 


1/2 


3/2 


3 


2 


-2 


2/3 


4/3 


4 


3 


-3 


3/4 


5/4 


5 


4 



Table 3: Values of /3 for which ME updates are closed- form, by root extraction of polynomials of degree 
d. 

Since g{y\y;x) is strictly convex w.r.t y, (|5ip has not more than two solutions, one of them being y. By 
construction, the selection of the other solution (provided that it exists) will provide ME steps that are 
larger than MM updates, i.e., 

Vfc, Ihf^-hlylhf^-hi (52) 

as illustrated by Figure [H To go further on the determination of this solution, a case-by-case analysis 
must be performed, depending on the range of /3. 

Case 1: /? G [0, 1) In that case we have 

9{y\mx) = _^ ^ i3 ^y'^^^ yy^^^ + <^st. (53) 

Let us remark that 

Vy,x>0, \iv[\g{y\y]x)=]in\g{y\y;x)^(X), (54) 

y^Q y-^oc 

SO that (|5ip always admits two positive solutions (or one double positive solution if y x), one of the 
two being y = y. The other one is the solution of interest. However, it is not closed-form, except for 
specific values of /3 (see Table [3]). More precisely, when /3 = 1 — 1/d and d is an integer, the solution can 
be found by solving the following polynomial equation of degree d, for z = y^/'^: 

d 

(1 - /3) ^ i-^-^z^ - x = (55) 
(.=1 

where z = y^^'^. Not surprisingly, the simplest case /3 = (d = 1) leads us to y = x, and thus to 
= h^. The case /3 = 0.5 {d = 2) is more interesting. The extraction of the positive root of ([55]) then 
provides the following update formula: 




Let us remark that this expression does not correspond to a multiplicative update, although it ensures 
that positivity is maintained. 



Case 2: /? e (1, 2] In that case we have 

g{y\y; a;) = ^ / - ^~~T^^'^~^ + '^^'^^ 

g{y\y;x) tends toward oo for y — > oo, but it remains finite for ?/ — > 0. As a consequence, Eq. (j51|) 
only admits the trivial solution y — y if g{y\y]x) > g(0\y;x), and also the unwanted solution if 
g{y\y', x) = g(0\y; x). It is only when g{y\y', x) < g{0\y] x) that a positive, non trivial solution exists. This 
solution is closed- form for specific values of /3 given in Table |3l They correspond to /3 = 1 + l/d, where d 
is an integer. Eq. ([5T|) then amounts to solve the following polynomial equation of degree d, for z = y^^"^: 

d 

Y^~z'^-^z^ -{d+\)x^(), (58) 

£=0 

with z ~ y^^'^- The simplest case is /3 = 2 (d = 1), and the solution is then given by ?/ = 2a; — y if y < 2a;, 
which yields the overrelaxed update 

hf^^2hf^-hk, (59) 
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(a) /? = 0.5 (b) (5 = 1.5 (c) /3 = 2 




Figure 3: Normalized updates hk/hk as functions of /hk {0 = 0.9). The region between the dotted, 
horizontal line and the solid line correspond to the steps that fulfill Eq. (1181) . The larger departure from 
the horizontal line, the larger the step. 

provided that hk < 2h^^. Note that this result more simply stems from the fact that when (3 — 2 the 
auxiliary function is parabolic and thus symmetric with respect to h^^. In the case f3 ~ 1.5 (d — 2), a 
positive ME update exists if < 3h^^, and it takes the following form: 




uMM \ 

12^ 3-1 . (60) 

h 



k ) 

As we need an update strategy that is defined everywhere, we propose to rely on a linear mixture between 
the MM update and a prolonged version of ME, defined as 

hi = ehf^'' + (1 - 6)hf^ (61) 
where 9 G (0, 1) and prolongs the ME update by zero when the latter does not exist: 

^pME^Ur if is defined ^^2) 
^ 1 otherwise 

It is mathematically easy to check that h^. fulfills Eq. (jl8p for all 9 £ [0, 1], and that positivity is main- 
tained for all 9 G [0, 1). In practice, values of 9 near one may be favored to produce larger steps. 

When /3 < or /? > 2, similar analyses can be conducted. In particular, there are specific values of (3 
for which a closed- form expression of ME updates is available according to Table [31 

When /? < 0, ME updates always exist since (j53p and (|54p still hold. Moreover, they provide non- 
increasing values of C(h), while the latter monotonicity property is not yet proved for the heuristic 
algorithm. However, simulations tend to indicate that the heuristic algorithm is faster than the ME 
algorithm (which is itself faster than the MM algorithm) in the case /3 < 0. This is in conformity with 
the fact that ME steps can then be proved to be smaller than heuristic steps (on the basis of the reversed 
inequality mentioned in Footnote ^ . 

When /3 > 2, ME updates do not necessarily exist, akin to the case /? S (1,2]. When they exist, 
they provide nonincreasing values of C(h), while the latter is not yet proved for the heuristic formula. 
However, since this range of /? values is not of utter practical interest, we will not go further into a 
detailed analysis here. 

4.4 Overrelaxation properties of the heuristic and ME updates 

As already stated, the heuristic and ME updates produce larger steps than the MM update, i.e., — hk\ 
and — hk \ are larger than \h^^ ^ ^fc|, for all values of /3 G M. This is a form of overrelaxation, which 
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will be shown in Section [5] to produce faster convergence in practice. The normalized ME (or pME) and 



heuristic updates studied in the above sections can be written function of hf^/hu, such that 



ii^/fS!!) (63) 

where /ifc is either h^, or hi. For the heuristic update, the function is simply given by 

f{x) — x^^'''^^\ For (3 S {0.5,1.5,2}, the function / corresponding to the ME/pME update is easily 
derived from equations (|60p and ([59)1 . Figure [3] displays the latter functions for the updates studied 
in this work when /3 € {0.5,1.5,2}. Overrelaxation appears from the fact that hk < whenever 

h^^ < hk (steps towards left) and hk > hf^ whenever hf^ > hk (s teps towards right). 

General results about overrelaxation of MM algorithms are given bv lSalakhutdinov and Roweid ( 2003 ). 



and in particular in the case of NMF. The authors consider the specific case of KL divergence but their 
study holds for any divergence. They show that, in a neighborhood of a stationary point, for any rj G (0, 2), 
relaxed updates h^ of the form 



hf _ (hf^V 



hk \ hk 



(64) 



will converge to the same point than h^^, with a different, possibly better, rate of convergence. In 
particular, the optimal learning rate ry, providing the largest rate of convergence, can be computed from 
the eigenvalues of the Jacobian, at convergence, of the mapping that relates h^^ at iteration (i) to h^^ 
at iteration {i + \). The optima l learning rate is sho wn to be always greater or equal to 1. A similar 
result was recently obtained by iBadeau et aP (|2010h . However, these results do not translate into a 



practical algorithm, because the latter relaxation property only holds locall y, and the computation of 
the op timal learning rate requires the stationary point to be known. As such, Salakhutdinov and Roweid 
( 20031) propose an adaptive scheme which incrementally proposes values of rj greater than one at each 



iteration, and backtrack to 77 = 1 when the criterion ceases decreasing. 

Our results show that for /3 S (0, 1) the learning rate 77 = l/7(/3) = 2-/3, corresponding to the h e uristic 



update, ensures descent of the criterion everywhere. The results of ISalakhutdinov and Roweid ([20031) 
indicate that the learning rate can be increased to 77 = 2 when the algorithm approaches the solution. 
Note that in the neighborhood of the solution the Taylor approximation /(x) « /(l) + /'(l)(a; — 1) applied 
to f{x) = implies that 

{hf-hk)^vihf''-hk). (65) 

A similar approximation carried out with the ME/pME updates defined by equations ([551) . (j60p for 
/3 e {0.5,1.5} reveals that in these two cases /'(I) = 2 (and by construction /(I) = 1). so that, in a 
neighborhood of the solution we have 

ihf^-hk)^2ihf'^-hk). (66) 

This means that the ME algorithms produce the largest admissible learning rate 77 = 2 in the neighborhood 
of the solution, while avoiding to adapt the learning rate so as to ensure monotonicity of the criterion. 
This results holds everywhere for /3 = 2. see ([59]) . b v symmetry of the auxiliary function w.r.t to hj{^^. The 



interested reader may also refer to (jLanteri et al.l .i2001. 201(1 for relaxation of multiplicative algorithms 



using adaptativc learning rates computed through line search. 

4.5 Implementation and complexity of the algorithms 

As seen in previous section, the update rules of all the studied algorithms can be expressed as functions of 
the ratio V^^C(h)/V^^C(h), which dominates the algorithmic complexities. Fortunately, the latter ratio 
takes a simple matrix form that leads to efficient implementations. As such, getting back to the original 
factorization problem, the heuristic update (j43p for factors H and W can conveniently be expressed in 
the following matrix form 

W^[(WH)-(^-^).V] 
H^H. ^T[wH]-(/^-i) ^^^^ 

,V [(WH)-(^-^).V]H^ 
[WH]-(/3-i)H^ 

where the division •/• is here taken entrywise. The MM update simply involves bringing the correc- 
tive ratio to the power 7(/3), and the ME update involves applying a function specific to the value 
of 13. Hence, the algorithms have similar complexity 0{FKN) and their implementation take simple 
forms. MATLAB implementations of the algorithms discussed in this paper are available online at 



http : //perso .telecom-paristech.fr/~fevotte/Code/ code_beta_niiif . zip 
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5 Simulations 



In this section wc report performance results of /3-NMF algorithms for the specific values (3 e {0.5, 1.5, 2}. 
These values are chosen for their practical interest and because a simple ME algorithms exist in their 
case. As such this section will evidence the performance improvement brought by the ME approach over 
the MM or heuristic approaches, with similar computational burden. More precisely, the ME algorithm 
considered in this section is the mixture of prolonged ME and MM, defined by Eq. ((6T|) and with 6 — 0.95, 
but we will still refer to it as ME for simplicity. The algorithms for all three considered values of j3 are 
compared on small-sized synthetic data in Section 15.11 The algorithms for (3 = 0.5 are analyzed in 
Section 15.21 on the b asis of a small music transcription example as this specific va l ue of j3 has proven 
efficient for this task ( FitzGerald et al. . 2009t Vincent et al. . 2010l : Henneauin et al. . 2010[) . 



In the following results we will display the cost values through iterations as well as, following lGonzalez and Zhang 
(j2005l ) , "KKT residuals" . The residuals allow to monitor convergence to a stationary point and are here 
defined as 



KKT(W) = II min |w, [(WH)-^'^-^) (-"vyfj _ V)]H'^| ||i/i^i^ 
KKT(H) = ||min|H,W^[(WH)-('^-2) (^jj- V)]| ||i/A'iV 



(69) 
(70) 



They are meant to converge to zero, by Eq. ([T7)) . Again, the monotonicity of the heuristic, MM and ME 
algorithms does not imply convergence of the iterates to a stationary point. Hence, displaying the KKT 
residuals allows to experimentally check whether convergence is achieved in practice. 

One iteration of each algorithm consists of updating W given H^*"-'^' and H given W''', and then 
normalize W'*^ and H'*^ to eliminate trivial scale indeterminacies that leave the cost function unchanged. 
The normalization step consists of rescaling each column of W so that ||wfc||i ~ 1 and rescale the fc*'' 
row of H accordingly. The normalization step is not required per se but is useful to display and compare 
the KKT residuals, which are scale-sensitive. 



5.1 Factorization of synthetic data 

We consider a synthetic data matrix V constructed as V = W*H* where the ground truth factors are 
generated as the absolute values of Gaussian noise0 The matrix can be exactly factorized so that all 
algorithms should converge to a solution such that Z)(V|WH) = 0. The dimensions are = 10, = 25, 
K = 5. The algorithms (heuristic, MM, ME for f3 = 0.5, MM and ME for /3 G {1.5,2}) are run for 10^ 
iterations and initialized with positive random values. Fig. |4j [5] and |6] display for each of the 3 values of 
/3 the normalized cost values D(V\WH) / FN , the KKT residuals, as well as "fit residuals" computed as 
||W(') - W\\f/FK and ||H(*) - B.\\f/KN, where W and H are the factor estimates at the end of the 
10^ iterations and ||.||f is the Frobenius norm. The fit residuals allow to measure the closeness of the 
current iterates to their end value. 



The cost values in all three cases converge to zero as an exact factorization is reached (oscillations 
appear in the end iterations as machine precision is reached). Convergence is achieved in all three cases, 
as shown by both the cost values and KKT residuals. We visually inspected the factorizations returned 
by the algorithms. For each value of /3 G {0.5, 1.5}, the different algorithms appeared to converge to the 
same solution (and solutions obtained for the two values of /3 appeared comparable). This was less clear 
for (3 = 2, where ME appeared to reach out a different solution than MM. Still, in this run ME provides 
fastest convergence for every considered value of /3. Other runs, obtained from other starting points (ob- 
tained randomly), tend to show that when the compared algorithms converge to the same solution, then 
ME converges faster. Convergence to a common solution can be controlled in the specific case where W 
is fixed and /3 G {1-5, 2}, because the objective function is then convex w.r.t H. In this scenario ME was 
found to always converge faster than MM. These simulations are reported in the companion report avail- 



able online at http : //perso . telecom-paxistech . f r/~f evotte/Samples/necoll/beta_ninf _supp . pdf 



The fit residuals in Fig. HI [S] and [H] show that full convergence will not need be attained to obtain 
satisfying solutions for most applications as the fit residual will be considered sufficiently small after a few 
hundred iterations. Note that the factor iterates do not necessarily converge to the ground truth values 
W* a nd H* (and this is what we observed indeed) because of the identifiability ambiguities inherent to 



NMF ( Donoho and Stoddenl . l2004HLaurberg et al.l . 120081) . 



*E.g., in MATLAB notation V = abs(raiidn(F,K))*abE(randn(K,N)). 
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Figure 4: One run of the heuristic (H), ME and MM algorithms on synthetic data with (3 = 0.5. Loga- 
rithmic scales for both x- and y- axes. 



Using a MATLAB implementation run on Mac 2.6 GHz with 2 GB RAM, the CPU time required 
by each algorithm for the 10^ iterations is about 60 s for /? G {0.5,1.5} and 20 s for ^ = 2, including 
the computation of the cost values and KKT residuals. The ME algorithm is marginally more expensive 
than MM, itself only slightly more expensive than the heuristic algorithm, for /3 = 0.5. The CPU times 
incurred by the algorithms when /? = 2 is considerably lower thanks to simplifications in Eq. (j67[) and (j68[) . 
Indeed, in latter case the term (WH)H"'" appearing at the denominator can more efficiently be computed 
as W(HH"'"), which involves a multiplication of matrices with smaller sizes. 



5.2 Audio spectrogram decompostion 

This section addresses the comparison of the heuristic, MM and ME algorithms for /3 — 0.5 applied to 
an audio spectrogram. We consider the short piano sequence of ( Fevotte et al. . 20091) . recorded in live 



conditions, composed of 4 musical notes, played all at once in the first measure and then played by pairs 
in all possible combinations in the subsequent measures. A magnitude spectrogram of the audio signal 
is computed, leading to nonnegativc matrix data V of size F = 513 frequency bins by iV = 674 time 
frames. The data is represented top-left of Fig. [71 

As discussed in ( Fevotte et al.l . 120091 ) . K was set to 6 so as to retrieve in W the individual spectra of 



each of the 4 notes and supplementary spectra corresponding to transients and residual noise. The three 
algorithms were initialized with common positive random values and run for 10^ iterations. Figure [7] 
displays the cost values and KKT residuals along the 10^ iterations. It was manually checked that the 
algorithms converged to the desired "ground-truth" solution, i.e., the notes, transients and residual noise 
spectra are correctly unmingled. The three plots show that the ME provides fastest convergence overall 
though, judging from the KKT residuals, it appears that convergence is not achieved within the 10^ 
iterations. However, the musical pitch values (computed from W at every iteration) converge to their 
ground truth values after only 30, 50 and 580 iterations for ME, heuristic and MM, respectively. Other 
initializations yielded two types of results. In a minority of cases, either the heuristic and MM update, 
on one side, or the ME update, on the other side, converged to a local solution. In the large majority of 
cases the three algorithms converge to the same solution and the results are similar to those of Figure [T] 
the heuristic algorithm produces largest decreases of the objective function in the early iterations and 
is then supplanted by ME. In some runs the pitch values converged faster with the heuristic algorithm 
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Figure 5: One run of the ME and MM algorithms on synthetic data with /3 = 1.5. Logarithmic scales for 
both X- and y- axes. 



than with ME, and it was found that MM is generally slower than the other two algorithms. These 
results suggest a mixed update of the form of Eq. (|5T|) where the mixture parameter 9 could be made 
iteration-dependent so as to give more weight to the heuristic update in the early iterations and then to 
ME. 



5.3 Face data decompostion 

Finally, in this section we consider decomposition of face data using /?-NMF. We use the Olivetti dataset, 
composed of 10 grayscale 8 bits 64 x 64 face images of 40 people. We retrieved the data in MATLAB 
format from http : //cs . nyu . edu/~roweis/data . html. The images are vectorized and form the columns 
of V, with dimensions F = 4096 and N = 400. Fig. |8] displays the objective functions of one run of the 
ME and MM algorithms for j3 e {1.5, 2}, and illustrate the faster convergence of ME. Other runs led to 
sensibly similar plots. 

As stated in the introduction, /3-NMF is popular in audio signal processing where the value of /? can 
be controlled so as to improve transcription or separation accuracy. The idea of tuning the value of 
P so as to optimize performance applies to any NMF-bascd method for any type of data. As such, to 
mot ivate the use of /3-NMF in a non-aud io setting we propose an image interpolation cxemplc, inspired 
by (ICichocki et all . i2008t ICemgij l2009l) . where we show the influence of 13 on the reconstruction of 
missing data. We discard 25 % of the Olivetti data randomly and produce NMF decompositions using 
the available data for /? G {—1,0,1,2,3} and K G {50,100,200}. Accounting for the missing data 
requires minor modifications in the algorithms, basically multiplying V and its approxirnate WH with a 
binary mask in which zeroes indicate r nissin g pixels, see Jlfol . 120081: ICichocki et all . 120081: iLe Roux et al " 



20081 ICemgill I2OO9I: ISmaragdis erahl . l2009l ) for similar setups. For simplicity we only considered the 
MM algorithm, as it is consistently defined for all values of /3. It was run from 5 different initializations 
for every combination {K,f3), and the factorization yielding lowest end cost value was selected. Fig. [§] 
displays the original image, missing pixels and reconstructions for two of the images in the dataset. 
Wc have also computed the Peak Signal to Noise Ratio (PSNR) between original and reconstructed 
imagesH The maximum mean PSNR value (averaged over all 400 images) is obtained for /3 ~ 2 and 



^PSNR is a standard evaluation criterion in image reconstruction, defined as 20 logj^Q(FP/||v — v||2), where v and v 
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K = 200. However, since the PSNR value is equivalent to the Euclidean distance between the original 
and reconstructed image, it is expected that the optimal value of /3 is biased towards the metric used to 
assess the quality of reconstruction. Perceptually, we often found the reconstruction obtained with /3 = 3 
to be more satisfying than with /3 = 2. 

6 Variants of ^-NMF 

In this section we briefly discuss how some common variants of NMF, penalized NMF and convex-NMF, 
can be handled under the /^-divergence. 

Penalized /3-NMF Supplementary functions of W and/or H arc often added to the cost function (jS]) 
so as to induce some sort of regularization of the factor estimates or so as to reflect prior belief, e.g., in 
Bayesian maximum a posteriori (MAP) estimation. When such penalty terms are separable in the columns 
of H or in the rows of W, penalized NMF essentially amounts to solving the following optimization 
problem: 

min Cp(h) =^ L>(v|Wh) + i(h) subject to h > (71) 

h 

where L{h) is the penalty term. An auxiliary function to Cp(h) is readily given by 

Gp(h|h) =^ G(h|h) + i(h) (72) 

where G(h|h) is any auxiliary function to C{h) = Z?(v|Wh). MM or ME algorithms can then be designed 
on a case-by-case basis. Let us consider a short example for illustration: £i-norm regularization. In that 
case we have 

L(h) = A^/ife (73) 

k 

denote the vectorized original and reconstructed images, and P is the maximum pixel possible value {P = 255 in our case). 
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log-magnitude spectrogram log(V) 



P = 0.5 - Normalized cost values 




where A is a positive weight parameter. Using the auxiliary function designed in Section [3.2l and Eq. (|36p. 
the gradient of the penahzed auxiliary function writes 



V^,GL(h|h) = 

/ 



d' { ^/l^/^ ) + d'{vf\vf) 



+ X. 



The MM algorithm for ^i-regularized /3-NMF takes a very simple form for /? < 1, such that 



7(/3) 



(74) 



This in particular leads to £i-regularized NMF algorithms for KL-NMF and IS-NMF with proven mono- 
tonicity. An update similar to Eq. (|74)) is obtained for > 2 but the A term appears through its sign 
opposite at the numerator, instead of appearing at the denominator. Hence the nonnegativity constraint 
m ay become active and must be treated carefully: in t hat case our result coincides with similar findings 
of Pauca et al. ( 2006[) : lM0rup and ClemmensenI (|2007t ) for the specific case of ^i-regularized NMF with 
the Euclidean distance {(3 = 2). In the case /3 € (1, 2) the MM algorithm does not come up with a simple 
closed-form update, which supports the fact in the penalized case handy algorithms may only come on a 
case-by-case basis. This is similar to Expectation-Maximization (EM) procedures for MAP estimation, 
in which the E-step is essentially unchanged but where the M-step might become intractable because of 
the penalty term. ME algorithms can also be designed for the ^i-regularized problem and as a matter 
of fact it can be shown that the results of Table [3] (i.e., the values of /? for which a closed-form update 
exists) still hold in that case. 



Convex /3-NMF In some recent NMF-related works the dictionary W is constrained to belong to a 
known subspace S G K;^^*^ such that 



W = SL 



(75) 
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p = 1 .5 - Normalized cost values P = 2 - Normalized cost values 




Figure 8: One run of the MM and ME algorithms on the Ohvetti dataset with /3 = 1.5 (left) and (3 = 2 
(right). Logarithmic scales for both x- and y- axes. 




K / 13 -I 1 2 3_ 

50 28.3 30.7 32.2 31.8 31.2 

100 26.6 30.6 32.5 33.2 32.3 

200 29.1 31.2 32.4 32.7 32.1 



K I P -I 1 2 3_ 

~~50 20 27J 29^6 29^5 28^8 

100 26.2 28.3 29.3 29.8 29.8 

200 27.4 28.3 29.0 30.0 30.0 



Figure 9: Interpolation results with the Olivetti dataset. Original and corrupted data are shown top 
left of each plot. Below are the reconstructions obtained for K G {50, 100, 200} and /3 e { — 1, 0, 1, 2, 3}. 
Tables report PSNRs (in dB) of the reconstructions. 



where L e M^^""^'. For example IPhig et all (|2010[) assume the columns of W to be linear combinations 
(with unknown expansion coefficients) of data points (colu mns of V) , so as to enforce the dictionary to 
be composed of data centroids, while IVincent et al. (j2010[) assume the dictionary element to be linear 
combinations of narrow band spectra, so a s to enforce harm onicity and smoothness of the dictionary. 
The term "convex- NMF" was introduced bv lDing et al. ( 2010l ) to express the idea that W belongs to the 
convex set of all nonnegative linear combinations of elements of S, but this does not make the optimiza- 
tion problem convex in itself, in the general case. 



In this setting, the dictionary update is tantamount to solving 

min Cct,(L) =^ L»(V|SLH) = '^d{ Vfnl'^Sfjnlmkhkn | subject to L > 0. (76) 

fn \ 7nk / 

As a matter of fact, this matricial optimization problem can be turned into vectorial nonnegative linear 
regression so that the results of Section |3] holds. Given some mappings {f,n) G {1,^"} x {1,-^} — > p G 
{1, FN} and (m, fc) G {1, M} x {1, iC} ^ q G {1, MK} let us introduce the following variables: T is the 
matrix of dimension FN x MK with coefficients tpq = Sfmhkn, y is the column vector of size FN with 
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coefficients = w/n, 1 is the column vector of size MK with coefficients = Imk- Then we have 

D{V]SLH)=Y,dUp\J2t,,l_A (77) 
p \ g I 

and thus the estimation of L amounts to the approximation v « Tl. As such, any of the algorithms 
described in Section |4] can be employed for this task. As before, the resulting vectorial updates can be 
turned into matricial updates, leading to simple and efficient implementations. For example, the MM 
update reads 

/ S^ [(SLH)-(^-^).V]Hn 
^^^\ [(SLH)-(^-i)] j ■ ^^^^ 



This result proves the monotonicity of some of t he al gorithms derived heuristically in (jVincent et al 



the more general /3-divergenceQ 



2010^ and also extends the results of ( Ding et al. . 2010h for convex NMF with the Euclidean distance to 



7 Conclusions 

This paper has addressed NMF with the /3-divergence. The problem may be reduced to a mere nonneg- 
ative linear regression problem and our approach is based on the construction of an auxiliary function 
G(h|h) which majorizes the objective function C(h) everywhere and is tight for h = h. Th e auxiliary func- 
tion u nifies existing auxiliary functions fo r the Eucl i dean d istance and the KL divergence ( Lee and Seund . 



200l[ ). for the "generalized divergence" of Kompas^ (|2007|) (i n essence the /3-divergence on its convex part 



i.e., /3 e [1,2]) and for the IS divergence (jCao et al.L 1999h . Various descent algorithms, free of tuning 
parameters, may then be derived from this auxiliary function. As such, the findings of this paper may 
be summarized as follows. 



The MM algorithm based on the described auxiliary function is shown to yield multiplicative al- 
gorithms for /3 e M, as described by Eq. (|38)) . For /3 G [1,2] (interval of values for which the /3- 
divergence is convex), the MM algorith m coincides with the heuristic algorithm given by Eq. (|43p . 



as already known from iKompassI (|2007t ) 



In Section 14.21 wc prove the monotonicity of the heuristic algorithm for /? S (0, 1) by proving the 
inequality G(h^|h) < G'(h|h). Hence, aggregating the existing monotonicity results for /3 = and 
/3 G [1,2], it can now be claimed that the heuristic algorithm is monotone for /3 G [0,2], which is 
the range of values of practical interest that has been considered in the literature. 

In Section 14.31 we introduced the concept of maximization-equalization (ME) algorithms. Such 
algorithms are exhibited for specific values of /3, in particular for /? e {0,0.5,1.5,2} which are 
values of practical interest. For /? = (IS divergenc e) the ME algorit hm coincides with the heuristic 
algorithm, whose monotonicity already holds from (jCao et al. . 119991) . For other values of /? the ME 



algorithms arc nonmultiplicative. For /? G {0.5, 1.5, 2} they amount to solving polynomial equations 
of order 1 or 2. The result section has illustrated the faster convergence of the ME approach w.r.t 
to MM or heuristic, with equivalent complexity. 

Finally, in Section |6] we have considered variants of NMF with the /3-divergence. We have explained 
how penalty terms may be handled in the auxiliary function setting; in particular we have presented 
simple multiplicative algorithms for l\ regularized KL or IS NMF. Then, we have shown how 
the algorithms constructed for plain NMF holds for convex-NMF, generalizing and proving the 
monotonicity of existing algorithms. 



As for perspectives, the present work leaves two important questions unanswered. The first one is 
the monotonicity of the heuristic algorithm for /3 ^ [0,2]. The monotonicity is observed in practice but 
we have not been able to come up with proofs in the presented setting. Either other approaches need to 
be followed or a different type of auxiliary functions than the one presented here needs to be envisaged. 
As suggested in Section 12. 1[ the convex-concave decomposition of the /3-divergence is not unique and 
decompositions other than the "natural" one employed in this paper may lead to auxiliary functions that 

^ More precisely, iDing et al. consider a "semi" -NMF version where S = V and the data is allowed to be real- valued 

while the nonnegativity constraint is solely imposed on L and H; our results do not apply to this more general framework 
but only to the special case where V in nonncgativc. 
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more closely fit to the criterion. The second, probably more ambitious question is the convergence of 
the sequence of it erates prod uced by the proposed algorithms a stationary point. Partial results exist for 
Euclidean NMF (jLinl . l2007al) . convergence of multiplicative rules for nonnegative linear r egression (i.e. , 
when only one of the two matrices is updated) has b een studied in a few cases, see, e.g., ( Titteringtonl . 
1987 : De Pierro . 1993tlEggermont and LaRicciaL 1998 ). but general re sults for NMF with th e /3-divergence 



are still lacking. A noteworthy attempt has recently been made by iBadeau et al. I (l201tt . which points 
difficulties in the convergence study due to the inherent scale ambiguity of factorization models. 

Finally, another relevant perspec tive i s the de sign of new types of /3-NMF algorithm s. In the Euclidea. n 
case, projected gradient methods (iLinl. 2007bl), seco nd-order active sets methods ( Kim et al. . 20081) . 



block-coordinate descent metho ds (IMairal et al.l. 1201(1) hav e recently been shown to outperform standard 
multiplicative updates, sec also ( Morup and HansenL l2009l ) for a comparison of a selection of algorithms. 
As such it would be interesting to study how these approaches may extend to the more general /J-NMF 
framework. 
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